In [1]:
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(os.path.join("..", "..")))
The data can be downloaded from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4 (see the module docstring of open_cp.sources.chicago See also https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
In this notebook, we quickly look at the data, check that the data agrees between both sources, and demo some of the library features provided for loading the data.
In [2]:
import open_cp.sources.chicago as chicago
import geopandas as gpd
import sys, os, csv, lzma
filename = os.path.join("..", "..", "open_cp", "sources", "chicago.csv")
filename_all = os.path.join("..", "..", "open_cp", "sources", "chicago_all.csv.xz")
filename_all1 = os.path.join("..", "..", "open_cp", "sources", "chicago_all1.csv.xz")
Let us look at the snapshot of the last year, vs the total dataset. The data appears to be the same, though the exact format changes.
In [3]:
with open(filename, "rt") as file:
reader = csv.reader(file)
print(next(reader))
print(next(reader))
In [4]:
with lzma.open(filename_all, "rt") as file:
reader = csv.reader(file)
print(next(reader))
print(next(reader))
As well as loading data directly into a TimedPoints class, we can process a sub-set of the data to GeoJSON, or straight to a geopandas dataframe (if geopandas is installed).
In [5]:
geo_data = chicago.load_to_GeoJSON()
geo_data[0]
Out[5]:
In [6]:
frame = chicago.load_to_geoDataFrame()
frame.head()
Out[6]:
We can save the dataframe to a shape-file which can be viewed in e.g. QGIS.
To explore the spatial-distribution, I would recommend using an interactive GIS package. Using QGIS (free and open source) you can easily add a basemap using GoogleMaps or OpenStreetMap, etc. See http://maps.cga.harvard.edu/qgis/wkshop/basemap.php
I found this to be slightly buggy. On Windows, QGIS 2.18.7 I found that the following worked:
chicago.shp file produced from the line above.
In [7]:
# On my Windows install, if I don't do this, I get a GDAL error in
# the Jupyter console, and the resulting ".prj" file is empty.
# This isn't critical, but it confuses QGIS, and you end up having to
# choose a projection when loading the shape-file.
import os
os.environ["GDAL_DATA"] = "C:\\Users\\Matthew\\Anaconda3\\Library\\share\\gdal\\"
frame.to_file("chicago")
In [8]:
with lzma.open(filename_all, "rt") as file:
features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
if event["properties"]["crime"] == "THEFT" ]
frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()
Out[8]:
In [9]:
frame.to_file("chicago_all_theft")
In [10]:
with lzma.open(filename_all, "rt") as file:
features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
if event["properties"]["crime"] == "BURGLARY" ]
frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()
Out[10]:
In [11]:
frame.to_file("chicago_all_burglary")
In [12]:
frame["type"].unique()
Out[12]:
In [13]:
frame["location"].unique()
Out[13]:
Upon loading into QGIS to visualise, we find that the 2001 data seems to be geocoded in a different way... The events are not on the road, and the distribution looks less artificial. Let's extract the 2001 burglary data, and then the all the 2001 data, and save.
In [14]:
with lzma.open(filename_all, "rt") as file:
features = [ event for event in chicago.generate_GeoJSON_Features(file, type="all")
if event["properties"]["timestamp"].startswith("2001") ]
frame = gpd.GeoDataFrame.from_features(features)
frame.crs = {"init":"EPSG:4326"} # Lon/Lat native coords
frame.head()
Out[14]:
In [15]:
frame.to_file("chicago_2001")
We check the following:
In the other notebook, we look at map projections. The data is most consistent with the longitude / latitude coordinates being the primary source, and the X/Y projected coordinates being computed and rounded to the nearest integer.
In [16]:
longs, lats = [], []
xcs, ycs = [], []
with open(filename, "rt") as file:
reader = csv.reader(file)
header = next(reader)
print(header)
for row in reader:
if len(row[14]) > 0:
longs.append(row[14])
lats.append(row[15])
xcs.append(row[12])
ycs.append(row[13])
In [17]:
set(len(x) for x in longs), set(len(x) for x in lats)
Out[17]:
In [18]:
any(x.find('.') >= 0 for x in xcs), any(y.find('.') >= 0 for y in ycs)
Out[18]:
In [ ]:
import collections
with lzma.open(filename_all, "rt") as file:
c = collections.Counter( event["properties"]["case"] for event in
chicago.generate_GeoJSON_Features(file, type="all") )
multiples = set( key for key in c if c[key] > 1 )
len(multiples)
In [ ]:
with lzma.open(file_all, "rt") as file:
data = gpd.GeoDataFrame.from_features(
event for event in chicago.generate_GeoJSON_Features(file, type="all")
if event["properties"]["case"] in multiples
)
len(data), len(data.case.uniques())